This is the first installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.
In this notebook, I demonstrate linear regression using the DC Bicycle share competition.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import code.Linear_Regression_Funcs as LRF
A description of the data can be found here: https://www.kaggle.com/c/bike-sharing-demand/data
In [2]:
train = pd.read_csv("./data/bike_sharing_demand/train.csv",
index_col='datetime', parse_dates='datetime')
train.head()
Out[2]:
In [3]:
y = train['count']
In [4]:
# Get dictionaries to hold scale factors for month and weather type
scales = LRF.get_scale(train,['weather','monthly'])
In [5]:
# X is an [m x n] matrix.
# m = number of observations
# n = number of predictors
X = LRF.make_matrix(train,weather_scale=scales['weather'],monthly_scale=scales['monthly'])
Ordinary Least Squares (OLS) assumes a squared loss function with no weights and no regularization.
a. sm.OLS.fit() solves the normal equations to estimate model parameters.
b. sm.OLS.fit_regularized() minimizes the cost function numerically using coordinate descent.
In [6]:
results = sm.OLS(y,X).fit()
In [7]:
# Print OLS regression results and diagnostics
results.summary()
Out[7]:
In [8]:
# Print score on Kaggle training data
ypredict = results.predict(X)
ypredict[ypredict<0]=0 # Make sure we don't predict negative ridership!
print "score on training data = ", LRF.score(y,ypredict)
In [9]:
# View Time series of ridership observations and predictions
start = 2700
end = start + 300
#start = 0; end = len(train)-1 # Uncomment to see entire timeseries
plt.plot(train.index[start:end],y[start:end],'-r',alpha=1,lw=3)
plt.plot(train.index[start:end],ypredict[start:end],'-b',alpha=0.3,lw=3)
Out[9]:
In [10]:
# Read test data
test = pd.read_csv("./data/bike_sharing_demand/test.csv",
index_col='datetime', parse_dates='datetime')
In [11]:
# Construct test model matrix
Xtest = LRF.make_matrix(test,weather_scale=scales['weather'],monthly_scale=scales['monthly'])
In [12]:
# Calculate predictions by applying model parameters to test model matrix
Ypredict = pd.DataFrame(results.predict(Xtest),index=Xtest.index)
Ypredict = Ypredict.apply(np.round)
Ypredict[Ypredict<0]=0
In [13]:
# Write to csv
Ypredict.columns = ['count']
Ypredict = Ypredict.astype(int) # Force integers in output
Ypredict.to_csv('./predictions/Linear_Regression_Prediction.csv',sep=',')
This submission received a score of 0.57986, placing 902 of 1481 submissions. Not bad for a quick linear regression model!
The evaluation function uses log of rides, so it's very important to be correct when ridership is low.
Implement weighted least squares (sm.WLS(y,X)) in which the smaller values of the observations are weighted more heavily.
Include more predictor variables. For example, use weather information after transforming it from the raw data. For example, try monthly temperature anomaly instead of simply temperature. Additionally, include an interaction term between temperature anomaly and month of the year (or day of week).
Regularize the model results. The test data performs worse than the training data, suggesting that the model overfits the training data. Try early stopping or including a regularization term using OLS.fit_regularlized().
Fine tune model predictions. For example, manually reduce ridership predictions during the last week of December, when many people will be out of town or not commuting to work.
Assume that the endogenous variable follows a Poisson distribution, not a normal distribution sm.GLM(y,X,family=sm.families.Poisson()). This is cheating a little bit because this tutorial is about linear regression.